Project: Create Optimal Hotel Recommendations¶

Author: Robert Zacchigna¶

Table of Contents¶

Problem Statement
Proposal
Dataset - Expedia Hotel Recommendations
- Columns
Imports

Part 1: Exploratory Data Analysis

Part 2: Data Preprocessing and Feature Reduction

Part 3: Model Evaluation and Selection

Part 4: Final Model and Analysis Results

Problem Statement¶

Online travel agencies are scrambling to meet the artificial intelligence driven personalization standard set by companies like Amazon and Netflix. In addition, the world of online travel has become a highly competitive space where brands try to capture our attention (and wallet) with recommending, comparing, matching, and sharing.

Proposal¶

Create optimal hotel recommendations for Expedia's users that are searching for a hotel to book, specifically predict which "hotel cluster" the user is likely to book, given his (or her) search details.

Split train.csv into a training and test set (feel free to select a smaller random subset of train.csv). There is another file named destinations.csv, which contains information related to hotel reviews made by users. Then, build at least two prediction models from the training set, and report the accuracies on the test set.

Dataset - Expedia Hotel Recommendations¶

Download Location: https://www.kaggle.com/c/expedia-hotel-recommendations/data

Columns:

date_time – Timestamp
site_name – ID of the Expedia point of sale (i.e. Expedia.com, Expedia.co.uk, Expedia.co.jp, ...)
posa_continent – ID of continent associated with site_name
user_location_country – The ID of the country the customer is located
user_location_region – The ID of the region the customer is located
user_location_city – The ID of the city the customer is located
orig_destination_distance – Physical distance between a hotel and a customer at the time of search. A null means the distance could not be calculated
user_id – ID of user
is_mobile – 1 when a user connected from a mobile device, 0 otherwise
is_package – 1 if the click/booking was generated as a part of a package (i.e. combined with a flight), 0 otherwise
channel – ID of a marketing channel
srch_ci – Checkin date
srch_co – Checkout date
srch_adults_cnt – The number of adults specified in the hotel room
srch_children_cnt – The number of (extra occupancy) children specified in the hotel room
srch_rm_cnt – The number of hotel rooms specified in the search
srch_destination_id – ID of the destination where the hotel search was performed
srch_destination_type_id – Type of destination
hotel_continent – Hotel continent
hotel_country – Hotel country
hotel_market – Hotel market
is_booking – 1 if a booking, 0 if a click
cnt – Numer of similar events in the context of the same user session
hotel_cluster – ID of a hotel cluster
srch_destination_id – ID of the destination where the hotel search was performed
d1-d149 – latent description of search regions

Imports¶

import random
import numpy as np
import pandas as pd
import seaborn as sb
import datetime as dt
import pandas_profiling as pp

from scipy.stats import norm
from matplotlib import pyplot as plt
from sklearn.decomposition import PCA
from sklearn.pipeline import Pipeline
from sklearn.tree import DecisionTreeClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, make_scorer
from sklearn.model_selection import train_test_split, GridSearchCV

Part 1: Exploratory Data Analysis¶

Randomly Sample Dataset (75,000 records) and Drop all Missing (NaN) Values¶

seed = 42 # set seed

n = sum(1 for line in open(filename)) - 1
s = 75000 # desired sample size

random.seed(seed)
skip = sorted(random.sample(range(1, n + 1), n - s)) # randomly sample dataset

# Read train.csv and drop all missing (NaN) values
hotelData = pd.read_csv('Expedia_Hotel_Data/train.csv', skiprows=skip).dropna().reset_index(drop=True)

print('Dataset shape: {:,} columns and {:,} rows'.format(hotelData.shape[1], hotelData.shape[0]))

Dataset shape: 24 columns and 47,981 rows

hotelData.head()

Merge destination.csv with Sampled train.csv data (on "srch_destination_id" column) and Drop All Missing (NaN) Values¶

destData = hotelData.merge(pd.read_csv('expedia-hotel-recommendations/destinations.csv'), 
                           how='left', on='srch_destination_id').dropna().reset_index(drop=True)

Move Target variable (hotel_cluster) to the Front of the Dataset¶

tmp = destData['hotel_cluster']

destData = destData.drop(['hotel_cluster'], axis=1)
destData.insert(0, 'hotel_cluster', tmp)

print('Merged Destination Dataset shape: {:,} columns and {:,} rows'.format(destData.shape[1], destData.shape[0]))

Merged Destination Dataset shape: 173 columns and 47,751 rows

destData.head()

Balance the Dataset so that the Target Variable has Equal Records for Each Hotel Cluster¶

balData = destData.groupby('hotel_cluster')
balData = pd.DataFrame(balData.apply(lambda x:
                                     x.sample(balData.size().min()).reset_index(drop=True))).droplevel('hotel_cluster').reset_index(drop=True)

print('Merged Balanced Dataset shape: {:,} columns and {:,} rows'.format(balData.shape[1], balData.shape[0]))

Merged Balanced Dataset shape: 173 columns and 6,500 rows

balData.head()

Pandas Profiling Report of the Final Balanced Dataset (Without Destination Columns)¶

pp.ProfileReport(balData[balData.columns[:24]]).to_notebook_iframe()

Part 2: Data Preprocessing and Feature Reduction¶

Convert Date and Time Columns to Datetime dtypes Instead of Strings and then Convert to Datetime Ordinal¶

for col in ['srch_ci', 'srch_co']:
    balData[col] = pd.to_datetime(balData[col], format = '%Y-%m-%d')
    balData[col] = balData[col].map(dt.datetime.toordinal)

balData['date_time'] = pd.to_datetime(balData['date_time'], format = '%Y-%m-%dT%H:%M:%S')
balData['date_time'] = balData['date_time'].map(dt.datetime.toordinal)

New Data Types for Each of the Datetime Columns¶

for col in ['date_time', 'srch_ci', 'srch_co']:
    print('Column: ' + col + ', Type: ' + str(type(balData[col][0])))

Column: date_time, Type: <class 'numpy.int64'>
Column: srch_ci, Type: <class 'numpy.int64'>
Column: srch_co, Type: <class 'numpy.int64'>

balData[['date_time', 'srch_ci', 'srch_co']]

Use PCA (Principal Component Analysis) to Reduce the Number of Destination Columns in the Dataset¶

I will be using PCA feature reduction technique to reduce the number of columns, specifically the destination columns, in the dataset down to a more manageable amount for training models.

fig = plt.figure()
fig.set_size_inches(15, 12)
sb.set(font_scale = 1.25)

N_COMPONENTS = len(balData.columns[24:])

pca = PCA(n_components = N_COMPONENTS)

pc_matrix = pca.fit_transform(balData[balData.columns[24:]])

evr = pca.explained_variance_ratio_ * 100
cumsum_evr = np.cumsum(evr)

tickMarks = 10

ax = sb.lineplot(x=np.arange(1, len(cumsum_evr) + 1), y=cumsum_evr, label='Explained Variance Ratio')
ax.lines[0].set_linestyle('-.')
ax.set_title('Explained Variance Ratio Using {} Components'.format(N_COMPONENTS))
ax.plot(np.arange(1, len(cumsum_evr) + 1), cumsum_evr, 'bo')

for x, y in zip(range(1, len(cumsum_evr) + 1), cumsum_evr):
    if x in np.arange(1, 8, 1) or x % tickMarks == 0:
        plt.annotate("{:.2f}%".format(y), (x, y), xytext=(2, -15), 
                     textcoords="offset points", annotation_clip = False)

ax = sb.lineplot(x=np.arange(1, len(cumsum_evr) + 1), y=evr, label='Explained Variance Of Component X')
ax.plot(np.arange(1, len(evr) + 1), evr,'ro')
ax.lines[1].set_linestyle('-.')
ax.set_xticks([i for i in range(1, len(cumsum_evr) + 1) if i in np.arange(1, 2, 1) or i % tickMarks == 0])

for x, y in zip(range(1, len(cumsum_evr) + 1), evr):
    if x != 1 and (x in np.arange(1, 5, 1) or x % tickMarks == 0):
        plt.annotate("{:.2f}%".format(y), (x, y), xytext=(2, 5), 
                     textcoords="offset points", annotation_clip = False)

ax.set_xlabel('Component Number')
ax.set_ylabel('Explained Variance')

Text(0, 0.5, 'Explained Variance')

From the graph above, we can see that of the 149 total components, using only the first 10 will account for almost 81% of the destination column data. As a result, i will be using the first 10 components to explain the destination column data.

Drop Destination Columns from x_train and Append First 10 PCA Components¶

balData = balData[balData.columns[:24]]

bal_PCA_Data = pd.concat([balData, 
                          pd.DataFrame(pc_matrix, columns=['PC-{}'.format(i) for i in range(1, N_COMPONENTS + 1)])], axis=1)

bal_PCA_Data = bal_PCA_Data[bal_PCA_Data.columns[:34]]

print('PCA Balanced Dataset shape: {:,} columns and {:,} rows'.format(bal_PCA_Data.shape[1], bal_PCA_Data.shape[0]))

PCA Balanced Dataset shape: 34 columns and 6,500 rows

bal_PCA_Data.head()

bal_PCA_Data.describe()

Annotated Correlation Matrix of Balanced Data with PCA Components¶

fig = plt.figure()
fig.set_size_inches(20, 15)
sb.set(font_scale = 0.8)

sb.heatmap(bal_PCA_Data.corr('pearson'), annot=True)

<matplotlib.axes._subplots.AxesSubplot at 0x1bfba8c1a30>

Part 3: Model Evaluation and Selection¶

The models I have selected to experiment with in this analysis are the following: Random Forest Classifier and Decision Tree Classifier. The models performances (accuracy Score) with the training data will be compared at the end to see which model performed the best and then the best model will be used as the final model for predicting on the test set.

Split Balanced PCA Dataset into Train and Test Sets¶

x_train, x_test, y_train, y_test = train_test_split(bal_PCA_Data[bal_PCA_Data.columns[1:]],
                                                    bal_PCA_Data['hotel_cluster'],
                                                    train_size=0.65,
                                                    random_state=seed)

X_Train Set¶

print('x_train shape: {:,} columns and {:,} rows'.format(x_train.shape[1], x_train.shape[0]))

x_train shape: 33 columns and 4,225 rows

x_train.head()

x_train.describe()

Y_Train Set¶

print('y_train shape: 1 column and {:,} rows'.format(y_train.shape[0]))

y_train shape: 1 column and 4,225 rows

y_train.head()

3450    53
3142    48
1030    15
6342    97
300      4
Name: hotel_cluster, dtype: int64

y_train.describe()

count    4225.000000
mean       49.943905
std        28.717438
min         0.000000
25%        25.000000
50%        50.000000
75%        75.000000
max        99.000000
Name: hotel_cluster, dtype: float64

Conduct RandomForest Classifier Modeling¶

Create Pipeline for Scaling and Running RandomForest Classification Modeling¶

rf_pipe = Pipeline(steps=([
    ('scale', StandardScaler()),
    ('rf', RandomForestClassifier(random_state=seed))
]))

Setup Parameters RandomForest Classification Model to be Tested by GridSearchCV¶

param_grid = {'rf__max_depth': [2, 4, 6],
              'rf__class_weight': ['balanced', 'balanced_subsample'],
              'rf__criterion': ['gini', 'entropy'],
              'rf__max_features': ['auto', 'sqrt', 'log2'],
              'rf__min_samples_leaf': [2, 3],
              'rf__min_samples_split': [2, 3],
              'rf__n_estimators': [100, 200]}

rf_grid = GridSearchCV(rf_pipe, scoring=make_scorer(accuracy_score), 
                       param_grid = param_grid, cv = 5, n_jobs = -1, verbose=2)

Fit Data to RandomForest Grid to Find the Best Parameters for the RandomForest Classification Model¶

rf_grid.fit(x_train, y_train)

Fitting 5 folds for each of 288 candidates, totalling 1440 fits

[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  25 tasks      | elapsed:    9.0s
[Parallel(n_jobs=-1)]: Done 146 tasks      | elapsed:   32.9s
[Parallel(n_jobs=-1)]: Done 349 tasks      | elapsed:  1.5min
[Parallel(n_jobs=-1)]: Done 632 tasks      | elapsed:  9.6min
[Parallel(n_jobs=-1)]: Done 997 tasks      | elapsed: 14.9min
[Parallel(n_jobs=-1)]: Done 1440 out of 1440 | elapsed: 27.9min finished

GridSearchCV(cv=5,
             estimator=Pipeline(steps=[('scale', StandardScaler()),
                                       ('rf',
                                        RandomForestClassifier(random_state=42))]),
             n_jobs=-1,
             param_grid={'rf__class_weight': ['balanced', 'balanced_subsample'],
                         'rf__criterion': ['gini', 'entropy'],
                         'rf__max_depth': [2, 4, 6],
                         'rf__max_features': ['auto', 'sqrt', 'log2'],
                         'rf__min_samples_leaf': [2, 3],
                         'rf__min_samples_split': [2, 3],
                         'rf__n_estimators': [100, 200]},
             scoring=make_scorer(accuracy_score), verbose=2)

Display Top Accuracy Scores Found by GridSearchCV¶

rf_df = pd.DataFrame(rf_grid.cv_results_).sort_values('mean_test_score', 
                                                      ascending=False)[['params', 'mean_test_score']].head(10)
rf_df

From the Dataframe Above, Display the Best Params and Score for the RandomForest Classification Model¶

print('Best Random Forest Classification Parameters\n' + '='*44)

for name, val in rf_df.iloc[0]['params'].items():
    print('{:>24}: {}'.format(name.replace('rf__', ''), val))
    
rf_acc = rf_df.iloc[0]['mean_test_score']
print('\nAccuracy Score: {}'.format(round(rf_acc, 4)))

Best Random Forest Classification Parameters
============================================
            class_weight: balanced
               criterion: entropy
               max_depth: 6
            max_features: sqrt
        min_samples_leaf: 3
       min_samples_split: 3
            n_estimators: 200

Accuracy Score: 0.1103

Conduct DecisionTree Classification Modeling¶

Create Pipeline for Scaling and Running DecisionTree Classification Modeling¶

dt_pipe = Pipeline(steps=([
    ('scale', StandardScaler()),
    ('dt', DecisionTreeClassifier(random_state=seed))
]))

Setup Parameters for DecisionTree Classifier Model to be Tested by GridSearchCV¶

param_grid = {'dt__criterion': ['gini', 'entropy'],
              'dt__class_weight': ['balanced', None],
              'dt__splitter': ['best', 'random'],
              'dt__max_features': ['auto', 'sqrt', 'log2'],
              'dt__max_depth': [2, 4, 6],
              'dt__min_samples_leaf': [1, 2, 4],
              'dt__min_samples_split': [2, 4, 6]}

dt_grid = GridSearchCV(dt_pipe, scoring=make_scorer(accuracy_score), 
                       param_grid = param_grid, cv = 5, n_jobs = -1, verbose=2)

Fit Data to DecisionTree Grid to Find the Best Parameters for the DecisionTree Classification Model¶

dt_grid.fit(x_train, y_train)

Fitting 5 folds for each of 648 candidates, totalling 3240 fits

[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  34 tasks      | elapsed:    0.2s
[Parallel(n_jobs=-1)]: Done 504 tasks      | elapsed:    4.0s
[Parallel(n_jobs=-1)]: Done 1316 tasks      | elapsed:   13.9s
[Parallel(n_jobs=-1)]: Done 2448 tasks      | elapsed:   25.7s
[Parallel(n_jobs=-1)]: Done 3240 out of 3240 | elapsed:   38.4s finished

GridSearchCV(cv=5,
             estimator=Pipeline(steps=[('scale', StandardScaler()),
                                       ('dt',
                                        DecisionTreeClassifier(random_state=42))]),
             n_jobs=-1,
             param_grid={'dt__class_weight': ['balanced', None],
                         'dt__criterion': ['gini', 'entropy'],
                         'dt__max_depth': [2, 4, 6],
                         'dt__max_features': ['auto', 'sqrt', 'log2'],
                         'dt__min_samples_leaf': [1, 2, 4],
                         'dt__min_samples_split': [2, 4, 6],
                         'dt__splitter': ['best', 'random']},
             scoring=make_scorer(accuracy_score), verbose=2)

Display Top Accuracy Scores Found by GridSearchCV¶

dt_df = pd.DataFrame(dt_grid.cv_results_).sort_values('mean_test_score', 
                                                      ascending=False)[['params', 'mean_test_score']].head(10)
dt_df

From the Dataframe Above, Display the Best Params and Score for the DecisionTree Classifier Model¶

print('Best Decision Tree Classification Parameters\n' + '='*44)

for name, val in dt_df.iloc[0]['params'].items():
    print('{:>23}: {}'.format(name.replace('dt__', ''), val))
    
dt_acc = dt_df.iloc[0]['mean_test_score']
print('\nAccuracy Score: {}'.format(round(dt_acc, 4)))

Best Decision Tree Classification Parameters
============================================
           class_weight: None
              criterion: entropy
              max_depth: 6
           max_features: auto
       min_samples_leaf: 1
      min_samples_split: 4
               splitter: best

Accuracy Score: 0.0774

Compare the Accuracy Scores of Each Model¶

Dataframe of the Accuracy Scores of Both Models (Sorted from Largest to Smallest)¶

acc_scores = [rf_acc, dt_acc]

modelTypes = ['Random Forest Classifier', 'Decision Tree Classifier']

acc_df = pd.DataFrame(zip(modelTypes, acc_scores), columns=['Model Type', 'Accuracy Score'])
acc_df = acc_df.nlargest(len(acc_df), 'Accuracy Score').reset_index(drop=True)

acc_df

From the above table we can see both models did not have very good accuracy but the Random Forest Classifier did perform the best so that will be the model that i will use to make predictions on the test set for final analysis and results.

Part 4: Final Model and Analysis Results¶

Construct Final Model - Random Forest Classifier¶

Display Best Parameters Found by GridSearchCV for the RandomForest Classifier Model¶

print('Best Random Forest Classifier Parameters\n' + '='*40)

params = {}

for name, val in rf_df.iloc[0]['params'].items():
    name = name.replace('rf__', '')
    
    params.update({name: val})
    print('{:>21}: {}'.format(name, val))
    
rf_acc = rf_df.iloc[0]['mean_test_score']
print('\nAccuracy Score: {}'.format(round(rf_acc, 4)))

Best Random Forest Classifier Parameters
========================================
         class_weight: balanced
            criterion: entropy
            max_depth: 6
         max_features: sqrt
     min_samples_leaf: 3
    min_samples_split: 3
         n_estimators: 200

Accuracy Score: 0.1103

Create Pipeline for Scaling and Running the Best RandomForest Classifier Model¶

best_pipe = Pipeline(steps=([
    ('scale', StandardScaler()),
    ('rf', RandomForestClassifier(**params, random_state=seed))
]))

Fit the Model to the Entire Training Dataset¶

best_model = best_pipe.fit(x_train, y_train)

best_model

Pipeline(steps=[('scale', StandardScaler()),
                ('rf',
                 RandomForestClassifier(class_weight='balanced',
                                        criterion='entropy', max_depth=6,
                                        max_features='sqrt', min_samples_leaf=3,
                                        min_samples_split=3, n_estimators=200,
                                        random_state=42))])

Use RandomForest Classifier Model to Predict the Hotel Clusters on Test Dataset¶

y_pred = best_model.predict(x_test)

Compare Accuracy Scores of Train Model and Best Model (with Test Data)¶

best_model_score = accuracy_score(y_test, y_pred)

print("Best Random Forest Classifier score using the test data\n" + '='*50 +
      "\nTest Accuracy Score: {}\n\nTrain Accuracy Score: {}".format(round(best_model_score, 4), round(rf_acc, 4)))

print('\nDifference between train and best model test accuracy scores: {}'
      .format(abs(round(best_model_score - rf_acc, 4))))

Best Random Forest Classifier score using the test data
==================================================
Test Accuracy Score: 0.1156

Train Accuracy Score: 0.1103

Difference between train and best model test accuracy scores: 0.0053

Since the accuracy scores is so close to the value i received during my training experiments, i am confident the model i have selected will perform well with future, unseen, customer hotel data.

Use RandomForest Classifier Model to Predict Hotel Clusters on the Whole Sampled Dataset¶

orginData_PCA_Pred = best_model.predict(bal_PCA_Data[bal_PCA_Data.columns[1:]])

print("Best Random Forest Classifier score using the Whole Sampled Dataset\n" + '='*67 +
      "\nAccuracy Score: {}".format(round(accuracy_score(bal_PCA_Data['hotel_cluster'], orginData_PCA_Pred), 4)))

Best Random Forest Classifier score using the Whole Sampled Dataset
===================================================================
Accuracy Score: 0.2666

Analysis Conclusion¶

The accuracy score for predicting on the entirety of the sampled dataset is roughly double what i got during experimentation, most likely due to the increase in dataset size. The score is still quite bad and would most likely improve with a larger sample size and a different tuned model (and a stronger computer to train the model). I had originally decided to use XGBoost but the model training was taking far too long than what was feasible for me (even with the small sample size) and thus was forced to change to a different model.

	date_time	site_name	posa_continent	user_location_country	user_location_region	user_location_city	orig_destination_distance	user_id	is_package	...	srch_children_cnt	srch_rm_cnt	srch_destination_id	srch_destination_type_id	cnt	hotel_continent	hotel_country	hotel_market	hotel_cluster
0	2014-08-11 08:24:33	2	3	66	348	48862	2234.2641	12	0	...	0	1	8250	1	1	2	50	628	1
1	2014-09-22 14:13:18	2	3	66	153	50542	761.0443	15632	1	...	1	1	8805	1	1	4	8	120	52
2	2014-12-28 12:36:25	2	3	66	220	43026	4703.5435	23234	0	...	0	1	8859	1	1	2	50	212	55
3	2014-08-13 13:41:21	2	3	66	348	48862	341.2798	23569	0	...	0	1	8287	1	2	2	198	397	9
4	2014-06-12 07:29:47	11	3	205	354	43201	45.6136	28560	0	...	0	1	12951	5	1	2	198	397	33

	hotel_cluster	date_time	site_name	posa_continent	user_location_country	user_location_region	user_location_city	orig_destination_distance	user_id	...	d140	d141	d142	d143	d144	d145	d146	d147	d148	d149
0	1	2014-08-11 08:24:33	2	3	66	348	48862	2234.2641	12	...	-2.384553	-2.345528	-2.396591	-2.399953	-2.388116	-2.394294	-2.400667	-2.398716	-2.386585	-2.390370
1	52	2014-09-22 14:13:18	2	3	66	153	50542	761.0443	15632	...	-2.202278	-2.266226	-2.219381	-2.174108	-2.219381	-2.242218	-2.276187	-2.276185	-2.175651	-2.276187
2	55	2014-12-28 12:36:25	2	3	66	220	43026	4703.5435	23234	...	-2.323738	-2.250667	-2.299810	-2.144616	-2.193264	-2.233139	-2.351674	-2.338092	-2.351674	-2.116166
3	9	2014-08-13 13:41:21	2	3	66	348	48862	341.2798	23569	...	-2.293244	-2.239556	-2.293315	-2.294890	-2.294890	-2.293244	-2.294888	-2.294174	-2.294890	-2.220201
4	33	2014-06-12 07:29:47	11	3	205	354	43201	45.6136	28560	...	-2.250248	-2.212484	-2.250323	-2.251294	-2.251294	-2.250248	-2.251293	-2.250853	-2.251294	-2.209588

	date_time	site_name	posa_continent	user_location_country	user_location_region	user_location_city	orig_destination_distance	user_id	is_mobile	...	d140	d141	d142	d143	d144	d145	d146	d147	d148	d149
0	2014-08-07 17:45:07	2	3	66	363	43253	4674.7903	912617	1	...	-2.318421	-2.044013	-2.258347	-2.166991	-2.282826	-2.298343	-2.324200	-2.304206	-2.317755	-2.283796
1	2014-01-25 15:59:37	11	3	205	135	56436	3391.4914	602160	0	...	-2.311260	-2.086433	-2.296965	-2.224663	-2.340364	-2.257228	-2.340607	-2.329235	-2.340607	-2.099513
2	2014-12-28 01:24:12	2	3	66	462	49272	1061.1388	79679	0	...	-2.253602	-2.281995	-2.285591	-2.196569	-2.237026	-2.256944	-2.254439	-2.279294	-2.290197	-2.031203
3	2014-12-12 20:57:39	2	3	66	442	47582	2168.1149	316597	0	...	-2.263954	-2.252633	-2.281117	-2.307805	-2.313573	-2.279911	-2.313964	-2.306305	-2.235204	-2.311443
4	2014-05-14 18:00:14	2	3	66	174	31320	2577.7681	97317	0	...	-2.323738	-2.250667	-2.299810	-2.144616	-2.193264	-2.233139	-2.351674	-2.338092	-2.351674	-2.116166

	date_time	srch_ci	srch_co
0	735452	735465	735472
1	735258	735291	735298
2	735595	735628	735630
3	735579	735620	735630
4	735367	735367	735370
...	...	...	...
6495	735070	735298	735300
6496	735314	735346	735350
6497	735089	735156	735159
6498	735433	735454	735455
6499	734933	735013	735014

	date_time	site_name	posa_continent	user_location_country	user_location_region	user_location_city	orig_destination_distance	user_id	is_mobile	...	PC-1	PC-2	PC-3	PC-4	PC-5	PC-6	PC-7	PC-8	PC-9	PC-10
0	735452	2	3	66	363	43253	4674.7903	912617	1	...	0.038359	0.717690	-0.276927	0.574149	0.199952	0.266928	0.041451	-0.423770	0.021183	0.141342
1	735258	11	3	205	135	56436	3391.4914	602160	0	...	0.301151	0.744871	-0.574808	0.666642	0.107351	0.158885	0.008331	-0.096678	0.047731	-0.078489
2	735595	2	3	66	462	49272	1061.1388	79679	0	...	0.039220	0.266363	-0.237070	0.350676	0.163222	-0.174890	-0.027802	-0.017746	-0.036956	0.049731
3	735579	2	3	66	442	47582	2168.1149	316597	0	...	0.275797	0.738823	-0.096367	0.426367	-0.107174	0.072494	0.005108	0.003337	-0.058467	0.147615
4	735367	2	3	66	174	31320	2577.7681	97317	0	...	0.263846	0.841132	-0.483993	0.617534	0.179581	0.131722	0.094535	-0.147429	-0.087810	-0.156483

	hotel_cluster	date_time	site_name	posa_continent	user_location_country	user_location_region	user_location_city	orig_destination_distance	user_id	is_mobile	...	PC-1	PC-2	PC-3	PC-4	PC-5	PC-6	PC-7	PC-8	PC-9	PC-10
count	6500.000000	6500.000000	6500.000000	6500.000000	6500.000000	6500.000000	6500.000000	6500.000000	6.500000e+03	6500.000000	...	6.500000e+03	6.500000e+03	6.500000e+03	6.500000e+03	6.500000e+03	6.500000e+03	6.500000e+03	6.500000e+03	6.500000e+03	6.500000e+03
mean	49.500000	735324.007385	6.259231	2.892308	86.548000	312.011385	27771.486308	2231.990799	6.021436e+05	0.132462	...	-1.246183e-16	4.022765e-16	-1.300840e-16	-2.361188e-16	1.997718e-16	-8.963770e-17	3.235702e-16	-3.733082e-16	1.803685e-16	-1.945794e-16
std	28.868291	204.931858	8.899831	0.580577	54.470165	153.420377	16684.656493	2416.008474	3.415110e+05	0.339018	...	6.878449e-01	5.027686e-01	3.318054e-01	2.059667e-01	1.949011e-01	1.755212e-01	1.489367e-01	1.383939e-01	1.257084e-01	1.231231e-01
min	0.000000	734875.000000	2.000000	0.000000	0.000000	135.000000	3.000000	0.050000	4.000000e+01	0.000000	...	-1.166558e+00	-7.828075e-01	-1.010672e+00	-5.923705e-01	-5.581819e-01	-4.655405e-01	-3.664375e-01	-5.178653e-01	-3.246116e-01	-4.286941e-01
25%	24.750000	735165.000000	2.000000	3.000000	66.000000	174.000000	13951.000000	352.902175	3.134652e+05	0.000000	...	-4.681122e-01	-3.486654e-01	-1.986688e-01	-1.140999e-01	-1.458131e-01	-1.146889e-01	-8.285172e-02	-7.098814e-02	-7.816345e-02	-5.694907e-02
50%	49.500000	735378.000000	2.000000	3.000000	66.000000	318.000000	27117.000000	1303.950450	6.002745e+05	0.000000	...	-1.389363e-01	-1.656830e-01	8.150757e-03	-2.773206e-02	-3.935751e-02	1.865249e-03	-2.881817e-03	-1.116574e-03	-1.644848e-02	1.095536e-03
75%	74.250000	735491.000000	8.000000	3.000000	66.000000	363.000000	42904.750000	3452.731175	8.972692e+05	0.000000	...	2.638461e-01	2.663628e-01	2.265778e-01	1.351711e-01	1.094830e-01	1.077041e-01	6.654613e-02	8.520337e-02	7.095119e-02	6.084757e-02
max	99.000000	735598.000000	40.000000	4.000000	215.000000	1017.000000	56498.000000	11633.559500	1.198685e+06	1.000000	...	2.042211e+00	1.336869e+00	1.047695e+00	6.666419e-01	7.933264e-01	7.877882e-01	8.462339e-01	5.280895e-01	6.852688e-01	6.620767e-01

	date_time	site_name	posa_continent	user_location_country	user_location_region	user_location_city	orig_destination_distance	user_id	is_package	...	PC-1	PC-2	PC-3	PC-4	PC-5	PC-6	PC-7	PC-8	PC-9	PC-10
3450	735590	8	4	77	977	53780	4922.0637	460864	0	...	-0.702593	-0.127470	0.150997	-0.050624	-0.117820	0.237877	0.071726	-0.105034	0.115129	-0.149392
3142	734914	2	3	66	315	3715	64.1402	51204	0	...	-0.898255	-0.122136	0.267205	0.019504	-0.090074	0.072747	0.114866	0.113533	-0.090878	0.159947
1030	735253	2	3	66	448	53078	430.6098	165691	0	...	-0.158677	-0.217689	-0.184908	-0.003483	-0.000764	0.065004	-0.044461	-0.100768	0.341342	0.305425
6342	735052	2	3	66	174	16634	2383.6284	324109	1	...	0.592609	1.078569	-0.107937	-0.421634	-0.027001	0.038146	-0.094263	0.087668	0.099357	-0.065189
300	735217	2	3	66	348	18146	1185.8921	1037393	1	...	0.207330	0.536730	-0.198669	0.183831	0.118125	-0.333120	-0.218281	0.178284	0.092877	-0.013112

	date_time	site_name	posa_continent	user_location_country	user_location_region	user_location_city	orig_destination_distance	user_id	is_mobile	is_package	...	PC-1	PC-2	PC-3	PC-4	PC-5	PC-6	PC-7	PC-8	PC-9	PC-10
count	4225.000000	4225.000000	4225.000000	4225.000000	4225.000000	4225.000000	4225.000000	4.225000e+03	4225.000000	4225.000000	...	4225.000000	4225.000000	4225.000000	4225.000000	4225.000000	4225.000000	4225.000000	4225.000000	4225.000000	4225.000000
mean	735321.946982	6.161420	2.899408	85.852781	310.261065	27899.190296	2245.810682	5.989017e+05	0.133018	0.255148	...	-0.007904	0.001486	-0.001139	0.000904	0.000370	0.004077	-0.001423	-0.000178	0.000043	0.001363
std	205.936301	8.766812	0.571632	53.927132	154.391940	16788.995787	2426.708839	3.409153e+05	0.339634	0.435996	...	0.680497	0.502323	0.331297	0.204682	0.194835	0.179355	0.147578	0.139207	0.128494	0.125869
min	734875.000000	2.000000	0.000000	0.000000	135.000000	40.000000	0.050000	1.186000e+03	0.000000	0.000000	...	-1.146758	-0.782807	-1.010672	-0.592371	-0.558182	-0.465540	-0.358568	-0.517865	-0.324612	-0.428694
25%	735157.000000	2.000000	3.000000	66.000000	174.000000	13951.000000	354.849800	3.126530e+05	0.000000	0.000000	...	-0.468927	-0.347628	-0.199449	-0.113636	-0.145813	-0.112506	-0.082852	-0.070391	-0.081343	-0.056706
50%	735375.000000	2.000000	3.000000	66.000000	315.000000	27251.000000	1327.180100	5.919250e+05	0.000000	0.000000	...	-0.145545	-0.165683	0.008151	-0.027383	-0.039272	0.008822	-0.003340	-0.000735	-0.019775	0.001861
75%	735490.000000	8.000000	3.000000	66.000000	363.000000	43264.000000	3447.943600	8.949280e+05	0.000000	1.000000	...	0.257810	0.266363	0.227685	0.133614	0.107351	0.113925	0.066100	0.085650	0.068691	0.064014
max	735598.000000	40.000000	4.000000	215.000000	1017.000000	56498.000000	11623.257100	1.198685e+06	1.000000	1.000000	...	2.042211	1.336869	1.047695	0.666642	0.793326	0.787788	0.846234	0.528090	0.685269	0.662077

	params	mean_test_score
135	{'rf__class_weight': 'balanced', 'rf__criterio...	0.110296
143	{'rf__class_weight': 'balanced', 'rf__criterio...	0.110296
125	{'rf__class_weight': 'balanced', 'rf__criterio...	0.110296
127	{'rf__class_weight': 'balanced', 'rf__criterio...	0.110296
133	{'rf__class_weight': 'balanced', 'rf__criterio...	0.110296
141	{'rf__class_weight': 'balanced', 'rf__criterio...	0.110296
123	{'rf__class_weight': 'balanced', 'rf__criterio...	0.109112
129	{'rf__class_weight': 'balanced', 'rf__criterio...	0.109112
131	{'rf__class_weight': 'balanced', 'rf__criterio...	0.109112
137	{'rf__class_weight': 'balanced', 'rf__criterio...	0.109112

	params	mean_test_score
596	{'dt__class_weight': None, 'dt__criterion': 'e...	0.077396
594	{'dt__class_weight': None, 'dt__criterion': 'e...	0.077396
630	{'dt__class_weight': None, 'dt__criterion': 'e...	0.077396
612	{'dt__class_weight': None, 'dt__criterion': 'e...	0.077396
614	{'dt__class_weight': None, 'dt__criterion': 'e...	0.077396
632	{'dt__class_weight': None, 'dt__criterion': 'e...	0.077396
600	{'dt__class_weight': None, 'dt__criterion': 'e...	0.077160
620	{'dt__class_weight': None, 'dt__criterion': 'e...	0.077160
638	{'dt__class_weight': None, 'dt__criterion': 'e...	0.077160
602	{'dt__class_weight': None, 'dt__criterion': 'e...	0.077160

	Model Type	Accuracy Score
0	Random Forest Classifier	0.110296
1	Decision Tree Classifier	0.077396